Improving Simultaneous Machine Translation with Monolingual Data

نویسندگان

چکیده

Simultaneous machine translation (SiMT) is usually done via sequence-level knowledge distillation (Seq-KD) from a full-sentence neural (NMT) model. However, there still significant performance gap between NMT and SiMT. In this work, we propose to leverage monolingual data improve SiMT, which trains SiMT student on the combination of bilingual external distilled by Seq-KD. Preliminary experiments En-Zh En-Ja news domain corpora demonstrate that can significantly quality (e.g., +3.15 BLEU En-Zh). Inspired behavior human simultaneous interpreters, novel sampling strategy for considering both chunk length monotonicity. Experimental results show our consistently outperforms random (and other conventional typical strategies) avoiding key problem -- hallucination, has better scalability. We achieve +0.72 improvements average against En-Ja. Data codes be found at https://github.com/hexuandeng/Mono4SiMT.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improving Neural Machine Translation Models with Monolingual Data

Neural Machine Translation (NMT) has obtained state-of-the art performance for several language pairs, while only using parallel data for training. Monolingual data plays an important role in boosting fluency for phrase-based statistical machine translation, and we investigate the use of monolingual data for neural machine translation (NMT). In contrast to previous work, which integrates a sepa...

متن کامل

Improving Statistical Machine Translation with Monolingual Collocation

This paper proposes to use monolingual collocations to improve Statistical Machine Translation (SMT). We make use of the collocation probabilities, which are estimated from monolingual corpora, in two aspects, namely improving word alignment for various kinds of SMT systems and improving phrase table for phrase-based SMT. The experimental results show that our method improves the performance of...

متن کامل

Machine Translation - 09: Monolingual Data

4.2.1 Balancing the LM and TM In order for the decoder to flexibly balance the input from the LM and TM, we augment the decoder with a “controller” mechanism. The need to flexibly balance the signals arises depending on the work being translated. For instance, in the case of Zh-En, there are no Chinese words that correspond to articles in English, in which case the LM may be more informative. O...

متن کامل

MonoTrans: Statistical Machine Translation from Monolingual Data

We present MonoTrans, a statistical machine translation system which only uses monolingual source language and target language data, without using any parallel corpora or language-specific rules. It translates each source word by the most similar target word, according to a combination of a string similarity measure and a word frequency similarity measure. It is designed for translation between...

متن کامل

Improving Translation Model by Monolingual Data

We use target-side monolingual data to extend the vocabulary of the translation model in statistical machine translation. This method called “reverse self-training” improves the decoder’s ability to produce grammatically correct translations into languages with morphology richer than the source language esp. in small-data setting. We empirically evaluate the gains for several pairs of European ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Proceedings of the ... AAAI Conference on Artificial Intelligence

سال: 2023

ISSN: ['2159-5399', '2374-3468']

DOI: https://doi.org/10.1609/aaai.v37i11.26497